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Abstract 

In this paper we introduce an algebraic recursive multilevel incomplete 
factorization preconditioner, based on a distributed Schur complement 
formulation, for solving general linear systems. The novelty of 
the proposed method is to combine factorization techniques of both 
implicit and explicit type, recursive combinatorial algorithms, multilevel 
mechanisms and overlapping strategies to maximize sparsity in the inverse 
factors and consequently reduce the factorization costs. Numerical 
experiments demonstrate the good potential of the proposed solver to 
precondition effectively general linear systems, also against other state-of- 
the-art iterative solvers of both implicit and explicit form. 

Keywords: linear systems; iterative solvers; preconditioners; sparse 
approximate inverse methods; multilevel reordering algorithms. 


1 Introduction 

Krylov subspace methods may be considered the method of choice for solving 
large and sparse systems of linear equations arising from the discretization of 
(systems of) partial differential equations on modern parallel computers. This 
class of algorithms are iterative in nature. At every step k, they compute the 
approximate solution Xk of a linear system Ax = b from the Krylov subspace of 
dimension k 

Kk{A, b) = span{b, Ab, A^b ,..., A^~^b}, 

according to different criteria for each given method. The computation requires 
matrix-vector products with the coefficient matrix A plus vector operations, 
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thus potentially reducing the cumbersome costs of sparse direct solvers on large 
problems, especially in terms of memory. All of the iterative Krylov methods 
converge rapidly if A is somehow close to the identity. Therefore, it is common 
replacing the original system Ax = b by 


M-^Ax = M-^b, 

(1) 

1 

11 

II 

1 

(2) 


for a nonsingular matrix M ra A. Systems Q and ([^ are referred to as left and 
right preconditioned systems, respectively, and M as the preconditioner matrix. 
In the case M is factorized as the product of two sparse matrices, M = Mi M2, 
like in the Hermitian and positive definite case, one might solve the modified 
linear system 

= M^^b, X = ( 3 ) 

If one may choose M so that M~^A, AM~^ or approximate the 

identity, and linear systems with M or with Mi and M2 as coefficient matrices 
are easy to invert, it is more efficient to apply a Krylov subspace method to the 
modified linear system. 

Optimal analytic preconditioners based on low order discretizations, nearby 
equations that are simple to solve, or similar ideas have been proposed in 
the literature for specific problems. However, the problem-specific approach is 
generally sensitive to the characteristics of the underlying operator and to the 
details of the geometry. In this study, we pursue an algebraic approach where the 
preconditioner M is computed only from the coefficient matrix A. Although not 
optimal for any specific problem, algebraic methods are universally applicable, 
they can be adapted to different operators and to changes in the geometry 
by tuning a few parameters, and are well suited for solving irregular problems 
defined on unstructured grids. 

Roughly speaking, most of the existing techniques can be divided into 
either implicit or explicit form. A preconditioner of implicit form is defined 
by any nonsingular matrix M « A, and requires to solve an extra linear 
system with M at each step of an iterative method. The most important 
example in this class is represented by the Incomplete LU decomposition 
(ILU), where M is implicitly defined as M = LU, L and U being triangular 
matrices that approximate the exact L and U factors of A according to a 
prescribed dropping strategy adopted during the Gaussian elimination process. 
These methods are considered amongst the most reliable in a general setting. 
Well known theoretical results on the existence and the stability of the 
factorization can be proved for the class of M-matrices [ 3 S], and recent studies 
are involving more general matrices, both structured and unstructured. The 
quality of the factorization on difficult problems can be enhanced by using 
several techniques such as reordering, scaling, diagonal shifting, pivoting and 
condition estimators (see e.g. [M SI 13113 IS] )■ As a result of this active 
development, in the last years successful results are reported with ILU-type 
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preconditioners in many areas that were of exclusive domain of direct solution 
methods like in circuits simulation, power system networks, chemical engineering 
plants modelling, graphs and other problems not governed by PDEs, or in 
areas where direct methods have been traditionally preferred, like in structural 
analysis, semiconductor device modelling and computational fluid dynamics 
applications (see e.g. mi n a [Ml sg). One problem with ILU-techniques 
is the severe degradation of performance observed on vector, parallel and GPUs 
machines, mainly due to the sparse triangular solves [33]. In some cases, 
reordering techniques may help to introduce nontrivial parallelism. However, 
parallel orderings may sometimes degrade the convergence rate, while more fill- 
in diminishes the overall parallelism of the solver m- 

Explicit preconditioning tries to mitigate such difficulties by approximating 
directly A~^, as the product M of sparse matrices, so that the preconditioning 
operation reduces to forming one (or more) sparse matrix-vector product, and 
consequently the application of the preconditioner may be easier to parallelize 
and numerically stable. Some methods can also perform the construction phase 
in parallel jM] [TUI [Ml [SZl [35] ; additionally, on certain indefinite problems with 
large nonsymmetric parts, the explicit approach can provide better results than 
ILU techniques (see e.g. [HI |8l El])- In practice, however, some questions 
need to be addressed. The computed matrix M could be singular, and the 
construction cost is typically much higher than for ILU-type methods, especially 
for sequential runs. The main issue is the selection of the non-zero pattern of 
M. The idea is to keep M reasonably sparse while trying to capture the ‘large’ 
entries of the inverse, which are expected to contribute the most to the quality 
of the preconditioner. On general problems it is difficult to determine the best 
structure for M in advance, and the computational and storage costs required 
to achieve the same rate of convergence of preconditioners given in implicit form 
may be prohibitive in practice. 

In this study, we present an algebraic multilevel solver for preconditioning 
general nonsymmetric linear systems which attempts to combine characteristics 
of both approaches. Assuming that the matrix A admits the factorization 
A = LU, with L a unit lower and U an upper triangular matrix, our method 
approximates the inverse factors L~^ and U~^. Sparsity in the approximate 
inverse factors is maximized by employing recursive combinatorial algorithms. 
Robustness is enhanced by combining the factorization with recently developed 
overlapping strategies and by using efficient local solvers. 

The paper is organized as follows. In Section [^ we describe the proposed 
multilevel preconditioner. In Section [^ we show how to combine our 
preconditioner with overlapping strategies, and in Section[^we assess its overall 
performance by showing several numerical experiments on realistic matrix 
problems, also against other state-of-the-art solvers. Finally, in Section [^ we 
conclude the study with some remarks and perspectives for future work. 
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2 The AMES solver 

Let 


Ax = b (4) 

be a n X n general linear system with nonsingular, possibly indefinite and 
nonsymmetric matrix A = {a^} S and vectors x,b G K". We assume 

that A admits for a triangular decomposition 

A = LU 


and we precondition system Q as 

M^AMi/y = M^b 

with Mi^ « L~^ and Mjj Ri U~^, clearly preserving symmetry and/or positive 
definiteness of A. This approach of preconditioning linear systems has been 
extensively investigated in a series of papers by Kolotilina and Yeremin 
EaEllEI], who prescribed the nonzero pattern of the inverse factors and 
Mu of A in advance equal to the pattern of the lower and upper triangular part 
of A + , respectively, and determined the entries of and Mj/ explicitly 

by solving linear equations involving the principal submatrices of A {the ‘FSAF 
preconditioner). Chow suggested to use as pattern for the inverse factors the 
structure of the lower and upper triangular part of {A + where p is a 

positive integer [HllalllS]. The larger p, in general the higher the quality of the 
computed preconditioner, although the construction, storage and application 
costs tend to increase rapidly with p. Blocking and adaptive strategies have 
been recently studied to overcome these problems [IHl lUl US. Benzi and Tuma 
proposed to compute the entries of matrices Ml and Mu by means of a (two- 
sided) Gram-Schmidt orthogonalization process with respect to the bilinear form 
associated with A, and to determine the best structure for the inverse factors 
dynamically, during the construction {the AINV’ preconditioner). Sparsity is 
preserved in the process by discarding elements having magnitude smaller than 
a given positive threshold nil]- 

In this study we analyse multilevel mechanisms, recursive combinatorial 
algorithms and overlapping techniques, combined with efficient local solvers, to 
enhance robustness and reduce costs for the approximation of the inverse factors. 
We refer to the resulting preconditioner as AMES (Algebraic Multilevel Explicit 
Solver). It is easier to describe the AMES method by using graph notation, 
dividing the solution of system Q in five distinct phases: 

1. a scale phase, where the coefficient matrix A is scaled by rows and columns 
so that the largest entry of the scaled matrix has magnitude smaller than 
one; 

2. a preorder phase, where the structure of A is used to compute a suitable 
ordering that maximizes sparsity in the approximate inverse factors; 
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3. an analysis phase, where the sparsity preserving ordering is analyzed and 
an efficient data structure is generated for the factorization; 

4. a factorization phase, where the nonzero entries of the preconditioner are 
computed; 

5. a solve phase, where all the data structures are accessed for solving the 
linear system. 

Below we describe each phase separately. 


2.1 Scale phase. 

We initially scale system Q by rows and columns as 

Dl^'^Ay = y = D^^x, (5) 

where the n x n diagonal scaling matrices Di and D 2 have the form 


m.ax|aij | 


Di{i,j) = 


if i = j 


if i ^ j 


max|aij | 




, if i = j 


, if i 7^ j 


For simplicity, we still refer in this paper to the scaled system ([^ as Ax = b. 


2.2 Preorder phase. 

We use standard notation of graph theory to describe this computational step. 
We denote by 12 (A) the undirected graph associated with the matrix 

^ \ A, if A is symmetric, 

1 >1 + A"’", if A is nonsymmetric. 


First, fl(A) is partitioned into p non-overlapping subgraphs 12^ of roughly equal 
size by using the multilevel graph partitioning algorithms available in the Metis 
package [35]. For each partition 12^ we distinguish two disjoint sets of nodes (or 
vertices): interior nodes that are connected only to nodes in the same partition, 
and interface nodes that straddle between two different partitions; the set of 
interior nodes of 12, form a so called separable or independent cluster. Upon 
renumbering the vertices of 12 one cluster after another, followed by the interface 
nodes as last, and permuting A according to this new ordering, a block bordered 
linear system is obtained, with coefficient matrix of the form 


A = P^AP = 


B F 
E C 


( 


B 2 


y El E2 


El \ 
F 2 


Bp Fp 

Ep C J 


( 6 ) 
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In each diagonal block Bi corresponds to the interior nodes of fli, and 
the blocks Ei and Fi correspond to the interface nodes of 17^; the block C 
is associated to the mutual interactions between the interface nodes. In our 
multilevel scheme we apply the same block downward arrow structure to the 
diagonal blocks Bi of A; the procedure is repeated recursively until a maximum 
number of levels is reached, or until the blocks at the last level are sufficiently 
small to be easily factorized. As an example, in Figure [^b) we show the 
structure of the sparse matrix rdb2048 from Tim Davis matrix collection m 
after three reordering levels. 

To reduce factorization costs, a similar permutation is applied to the Schur 
complement matrix S = C — EB~ 

( Bsi Fsi \ 


S = 


\ Esi 


^F as 

follows 

Esi 

Bs2 


Es2 


Bsp 

Esp 

Es2 ■ 

■ ■ Esp 

Cs 


(7) 


2.3 Analysis phase. 

In the analysis phase, a suitable data structure for storing the linear system is 
defined, allocated and initialized. We use a tree structure to store the block 
bordered form ([^ of A. The root is the whole graph 17, and the leaves at 
each level are the independent clusters of each subgraph. Each node of the 
tree corresponds to one partition 17^ of 17(A), or equivalently to one block Bi 
of matrix A. The information stored at each node are the entries of the off- 
diagonal blocks E and F of B^s father, and those of the block C of Bi after 
its permutation, except at the last level of the tree where we store the entire 
block Bi- All these matrices are represented in the computer memory using a 
compressed sparse row storage format, except for blocks Fi that are stored in 
compressed sparse column format. Blocks Ei and Fi can be very sparse; many 
of their rows and columns can be zero, and this leads to a significant saving of 
computation. 


2.4 Factorization phase. 


The approximate inverse 

factors L ^ and U ^ of A v 


/ 

IFi \ 




W 2 








C/-1 W, 



V 

Us^ y 



--1 


L 


-1 




where 


B, = = -U-^L-^F,Us\G, = -Lg^E,U-^E 


V Gi G 2 ■■■ 

Gp 

(8) 

-Ls^E,U-^L-^ 

(9) 


--1 
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(a) The original structure of 
the rdb204S matrix. 


(b) The structure of rdb2048 
after permutation. 


(c) The structure of the 
inverse factor. In red 

are displayed the entries 
actually stored. 


Figure 1: Structure of the multilevel inverse-based factorization for the matrix 
rdh2048. 


and Lsi Us are the triangular factors of the Schur complement matrix 


S = C-j2E^B-^F,. 


( 10 ) 


i=l 


Some fill-in may occur in L~^ and U~^ during the factorization, but only 
within the nonzero blocks. This two-level reordering scheme was used in the 
context of factorised approximate inverse methods for the parallelization of 
the AINV preconditioner in [2]. Differently from [2], we apply the arrow 
structure recursively to the diagonal blocks and to the first level Schur 
complement as well, to gain additional sparsity. The multilevel factorization 
algorithm requires to invert only the last level blocks and the small Schur 
complements at each reordering level; the blocks Wi, Gi do not need to be 
assembled explicitly, as they may be applied using Eqn ([^. For the rdb2048 
problem, in Figure[^c) we display in red the actual extra storage required by the 
exact multilevel inverse factorization in addition to matrix A; these represent 
only 34% of the total nonzeros of A. From the knowledge of the red entries, the 
blue ones can be retrieved from Eqn ([^, using the off-diagonal blocks of A. We 
also permute the large Schur complement at the first level into a block bordered 
structure, until we reach a maximal number of levels or a given minimal size. 
The last-level matrix is inverted inexactly. An inexact solver is also used to 


factorize the last-level blocks Bi in (101. 


2.5 Solve phase. 

In the solve phase, the multilevel factorization is applied at every iteration 
step of a Krylov method for solving the linear system. Notice that the inverse 
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factorization of A may be written as 


(PAP^) ^ 


C/-1 W \ f L-i 0 \ 
0 t /5 V H G ) 


( 11 ) 


where W = —U~^L~^FUg^, G = —Lg^EU~^L~^, and Lg, Us are the inverse 
factors of the Schur complement matrix S = C — EB~^F. 

From Eqn. 0. we obtain the following expression for the exact inverse 

/ b-^ + b-^fs-^eb-^ -b-^fs-^ \ 

-S-^EB-^ 5-1 ) ■ 


We can derive preconditioners from Eqn. (12) by computing approximate solvers 
B~^ for B and S~^ for S. Hence the preconditioner matrix M will have the 
form 

B-'^ + B-^FS-^EB-^ -B-^FS-^ 

-S-^EB-^ S-^ 


M = 


and the preconditioning operation 


yi 

y2 


= M 


Xi 

X2 


writes as Algorithm 


Notice that Algorithm is called recursively at lines 1-3, as B and S also have 
a block bordered structure upon permutation. 


Algorithm 1 The preconditioning operation in the AMES solver. 

1: Pi = B~^Xi 

2 : [P2,P^] = S~'^[E ■ pi,X2\ 

3 : [pi,P5\= B-^[F ■P2,F ■P'i] 

4: yi=Pl+PA- P5 
5: 2/2 = P3 - P2 


3 Combining the AMES solver with overlapping 

In [23, Grigori, Nataf and Qu have introduced an overlapping technique 
to enhance the robustness of multilevel incomplete LU factorization 
preconditioning computed from matrices reordered in arrow form, e.g. using 
the nested dissection method by George m- The multilevel mechanism 
incorporated in the AMES preconditioner described in the previous section is 
based on a nested dissection-like ordering, and thus it can easily accomodate 
for overlapping. We have tested this idea in our numerical experiments, and 
in this section we shortly describe the procedure adopted. The results of our 
experiments are reported in Section]^ 













3.1 Background 

Let = {V{ft),E{^l)) be the graph of A, k(ri) denoting the set of vertices 
and E{Q) the set of edges in 12. If the graph is directed, we denote an edge 
of E issuing from vertex u to vertex v as (m, u); u is called a predecessor of v, 
and V a successor of u. If the graph is undirected, we denote the edges of E by 
non-ordered pairs {m, u}; u is called a neighbour of v. As in the previous section, 
we assume that 12 is partitioned into p independent non-overlapping subgraphs 
r2i, ..., Up, and we call S the set of separator nodes, straddling between two 
different partitions. Goal of overlapping is to extend each independent set of 
12 by including its direct neighbours, similarly to the overlapping idea used in 
other domain decomposition methods, for example in the restricted additive 
Schwarz method [39l |40] . 

Following [20], we denote by V{V,i^ext) the separator nodes that are 
successors of 12,, 

V{n,,ext) = {vG k(5)|37r G k(12,), {u,v) G E{n)} C V{S), (13) 

and by V{ftext) the complete set of successor nodes of all the subdomains 

k(12e.i) = u )■ (14) 

i—l\p 

Then 12^ is extended to the set 12^ as 

V{Cl,) = k(0,) U ^(12,,^.*), I = 1, ... ,P, (15) 

and the separator S is extended to S by adding the successors of nodes in 
V{ilext), that is 

V{S) = V{S) U {u G k(12,), i = l,...,p\3uG {u, v) G A(12)}. (16) 

Using this notation, the overlapped graph of A, 12 = (U(12), if(12)), is 
introduced as follows. First define the overlapped subgraph 12^ and the 
overlapped separator S as, respectively, 

U(12,) = {(a:,i) : x G 12,}, 

V{S) = {(a:,s) : x G S}. 

For simplicity we refer to {x, i) as Xi. Then the vertex set V (12) of the overlapped 
graph 12 is formed by the disjoint union of the U(12i)’s and of U(S') as 

U(12) = I U U(4)) UU(^). (17) 

\iGl:p / 

Recall that, given the union B of a family of sets indexed by the index set I, 

B ^[jAi = |J{x : X G Ai}, 
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their disjoint union C is defined as the set 


C = |J{(a;,*) : x G Ai}. 
i&I 


At this stage, it is useful to introduce the two projection operators Hi and 112 
such that 

Hi : (x, i) I—>■ X 


and 


112 : {x, i) I—>■ i. 

With this notation, the set of edges of the overlapped graph is defined 
according to their projection onto the original graph as follows 


E(Sl,) 


{{ui,Vi)\ui G v{hi),Vi G v[hi), (ni(Mi),ni(ui)) g E{n)}, 


(18) 


E{S) = {{Us,Vs)\Us G 


ViS),Vs G y(5),(ni(u,),ni(u,)) G A(r!)}, 


(19) 


m,5) 


{{Ui,Vs)\Ui G V{ni),Vs 


GV{S),{Tliiu,),IlM) G E{n), 


$Vi G V{Cli), {ui,Vi) G E{n^)}, 


( 20 ) 


E{s,Cii) = {(us,uj)|us G v{s),v^ G y(fij), (ni(us),ni(ui)) g E{ii), 


$VsGV{S),{us,Vs)gE{S)}. (21) 

The following property, established in [5D], ensures an equivalence between 
the equations of the overlapped system and those of the original system. 

Property 1 Let LI be the associated directed graph of a given system of linear 
equations and u be a vertex ofV{Ll). Let be the overlapped graph, and let Ui 
be a vertex ofV{Ll) such that ni(ui) = u G 1^(11). For each edge (u,v) G E{Ll), 
there is a unique Vj G 1/(12) such that we have both ni(uj) = v and (ui,Vj) G 

E{n). 

This property shows that there exists a bijection between the nonzeros of 
the equation corresponding to vertex u in the original system and the nonzeros 
of the equation corresponding to its dual Ui, where ni(ui) = u. From a matrix 
viewpoint, to each nonzero entry dui,vi in the overlapped matrix there is a unique 
nonzero entry in the original matrix, where ni(Mi) = u and ni(ui) = v. 
Therefore there is a one-to-one correspondence between equations in the original 
system and those in the overlapped system. By solving the overlapped system, 
we can automatically obtain the solution of the original system. 
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3.2 Example 

We display a simple example from |20] to describe shortly how the 
overlapping procedure works in practice. We consider a 5 x 5 matrix having 
the structure shown in Figure [^a). The graph consists of two independent 
subgraphs = {1,2}, 112 = {3} and one separator S = {4,5}. We just pick 
the first subgraph and the separator set to explain. Separator nodes that are 
successors of Hi are the set 


E(lll,e.t) = {4,5} 

and we have 

= l/(0)i U = {1, 2,4,5}, 

so that 

= {li,2i,4i,5i}. 

Analogously, 

E(ll2.e.t) = {4,5} 

and 

EU ^2,ext — {4, 5}. 

Next, we compute the overlapped separator set S. The vertices of E(ni) and 
t4(H2) directed by Vlflext) are {1,3}, so 

F(A) = y(5)U{l,3} = {4,5,l,3} 


and 


E(5) = {4„5„1«,3J. 


According to Eqns. (18l-(21), the edges of the overlapped subdomain A(Hi) 
are defined based on their projection onto the original graph. The first diagonal 
block of the overlapped matrix is formed by picking the E(iii) = {1,2,4, 5} 
rows and columns of the original matrix 



1 

2 

4 

5 

1 


o 

o 

o} 

2 

o 

o 



4 

o 


o 


5 




oy 


Similarly for th e ot her two diagonal blocks, and this is shown in Figure [^b). 

From Eqn. (201, we can construct the edges from Cti to S. These are the 
nonzero entries of the overlapped interface block Fi , adopting the same notation 
as in We pick the E(lli) = {1,2,4, 5} rows and V{S) = {4,5,1,3} columns 
of the original matrix, and we set the columns corresponding to the common 
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Oil ai2 


014 Ol5 

0-21 O 22 




033 

O 34 O 35 

041 

O 43 

O44 

<251 

O 53 

O 55 


(a) The original matrix 


Oil 

O 12 

Ol4 

021 

O 22 


041 


O 44 

O 51 




0-12 


015 


033 

043 

053 


034 

044 


035 


043 

053 


055 


041 

051 


O44 


O4I 

O43 


O55 

O51 

O53 

Ol4 

Ois 

Oil 


O34 

O35 


O33 . 


(b) The matrix after one-level overlapping 


Figure 2: Matrix structure before and after applying the overlapping procedure. 


vertexes of and S to zeros. In our example this results in zeroing out the 
columns of Fi indexed by U ^ = {4, 5,1}, giving 



4 

5 

1 

3 


4 5 1 

3 

1 


X 

0 


1 

f 

\ 

2 



0 


2 



4 

'A' 


X 

X 

^ 4 


X 

5 

1 

X 

X 

xj 

5 

[ 

X/ 


Similar procedure is followed for the other blocks Fi, Ei. Finally, the overlapped 
matrix has the form given in Figure [^b). The block arrow structure of the 
original matrix is preserved. However, symmetry is lost and the sparsity pattern 
also changes significantly. 


3.3 Analysis 

It is interesting to analyse the effect that overlapping may produce on 


the AMES method. 


According to (15) and (18), the size and the number 

According to 


of nonzeros in each subgraph is increased after overlapping. 

(20), the interconnections between subdomains and separator are reduced in 
the overlapped system, as the original interconnectivities are all removed. 
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The more nodes are added to the subgraphs, the richer they are in terms of 
information about the system matrix, and a larger performance improvement 
may be expected. In the overlapped system, each block Bi has the following 
structure 


vm 


1 /( 0 ,) 

B 


£/( 0 ^, ^i^ext') 

Biyii^ext) 


From Eqn. (13) we see that the set of neighboring nodes V{D.i^f,xt) 


corresponds to the nonzero columns of the block F), and the nonzero elements 
of Fi are determined by the set of interconnections E{U,i, fliext)- Therefore, the 
more dense and larger the blocks Fi,i = 1 : p, (that is, the size of separator) 
in the original matrix, the more nodes and interconnections are added to 
subdomains, and a larger reduction of the number of iterations can be achieved. 


3.4 Algorithmics 

The AMES preconditioning algorithm described in Section with one extra 
overlapping phase writes as follows: 

1. a scale phase, where the matrix A is scaled by rows and columns so that 
the largest entry of the scaled matrix has magnitude smaller than one; 

2. a preorder phase, where the structure of A is used to compute a suitable 
ordering that can maximize sparsity in the approximate inverse factors; 

3. an overlap phase, which extends each block Bi and the Schur complement, 
and generates the overlapped matrix A and the right-hand side vector 6; 

4. an analysis phase, where the sparsity preserving and overlapping orderings 
are analyzed and an efficient data structure is generated for the 
factorization; 

5. a factorization phase, where the entries of A are processed to explicitly 
compute the approximate inverse factors; 

6 . a solve phase, that accesses all the data structures for solving the 
overlapped linear system. 

7. a restriction phase, that restricts the solution obtained from the 
overlapped system to the original system, and obtains the solution. 
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4 Numerical experiments 

In this section we present the results of our numerical experiments to illustrate 
the performance of the AMES preconditioner, also against other state-of-the-art 
methods and software for solving general linear systems. The selected matrix 
problems are extracted from the public-domain matrix repository available at 
the University of Florida [T^, and arise from various application fields. We 
present a summary of the characteristics of each linear system in Table 
We applied AMES as a preconditioner for the Generalized Minimal Residual 


Matrix problem 

n 

Field 

nnz{A) 

orsirr_l 

1,030 

Oil reservoir simulation 

6,858 

1138_bus 

1,138 

Bus Power System 

4,054 

bcsstk27 

1,224 

BCS Structural Engineering Matrix 

28,675 

epbO 

1,794 

Plate-fin heat exchanger 

7,764 

CZ20468 

20,468 

Glosest Point Method 

206,076 

raefsky3 

21,200 

Fluid Structure Interaction 

1,488,768 

ABACUS_shell_ud 

23,412 

ABAQUS benchmark 

218,484 

sme3Db 

29,067 

3D structural mechanics problem 

2,081,063 

viscoplastic2 

32,769 

FEM discretization 

381,326 

CZ40948 

40,948 

Closest Point Method 

412,148 

rmalO 

46,835 

3D CFD Model 

2,374,001 

finan512 

74,752 

Portfolio optim 

596,992 

helm2d03 

392,257 

Helmholtz eq. on a unit square 

2,741,935 

parabolic fem 

525,825 

Parabolic FEM 

3,674,625 


Table 1: Set and characteristics of the test matrix problems. 


(GMRES) method by Saad and Schultz [12]. In all our runs we started the 
iterative solution from the zero vector and we stopped it when either the initial 
residual was reduced by twelve orders of magnitude or when no convergence 
was achieved after 5000 matrix-vector products. To limit memory costs, we 
restarted the GMRES algorithm every 500 iterations. The right-hand side b of 
the linear system was chosen so that the solution is the vector of all ones, that is 
b = Ae with e = [1,..., 1]^. In each run we recorded the following performance 
measures: 

1 . the density ratio i that is the ratio between the number of 

nonzeros in the preconditioner matrix M = Mu Ml versus the number of 
nonzeros in the coefficient matrix A; 

2. the number of iterations Its required to reduce the initial residual by 12 
orders of magnitude starting from the zero vector; 

3. the GPU time cost in seconds for completing the preorder phase (denoted 
by tp), for constructing the approximate inverse factorization (t/), and for 
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solving the linear system (tg). Symbol means that the corresponding 
phase does not apply to the given run. For example, some of the 
preconditioners used for the comparison against our method do not have 
a preorder phase. 

The codes were developed in Fortran 95 and the experiments were run in 
double precision floating point arithmetic on a PC equipped with an Intel(R) 
Core(TM)2 Duo CPU E8400, 3.00 GHz of frequency, 4 GB of RAM and 6144 KB 
of cache memory. In the coming sections we study the effect of using different 
parameter settings, and we illustrate the overall performance on the selected 
matrix problems. 


4.1 Performance of the multilevel mechanism 

The AMES method can be seen as a multilevel generalization of factorized 
approximate inverse techniques such as the FSAI preconditioner by Kolotilina 
and Yeremin, and the AINV preconditioner by Benzi and Tuma, that we 
recalled in Section Therefore, first we present some comparison between 
these methods, to show the benefit of the multilevel mechanism. The results 
are reported in Table[^ For these runs, we considered four matrix problems from 
Tablej^ that are orsirr_l, 1138-bus, bcsstk27 and epbO. In our AMES solver, we 
inverted the last level blocks using ILU, FSAI and AINV factorizations. For ILU, 
we used the multilevel implementation available in the ILUPACK package [5] 
(this combination is referred to as AMESJLU in the table). For FSAI, we 
used the structure of the nonzero pattern of the lower (resp. upper) triangular 
part of the symmetrized block for the approximate inverse factors, and also 
the square of this pattern (this combination is referred to as AMES-FSAI). 
Finally, for AINV we used the implementation kindly provided by the authors 
(this combination is referred to as AMES-AINV). The dropping threshold value 
selected for the AINV, AMES-AINV and AMES-ILU methods (referred to as 
Drop in the Table) is an absolute value, and was computed so that the resulting 
preconditioners had roughly equal memory cost. We used the default value for 
the parameter condest = 10 (norm bound for the inverse factors L~^ and U~^) 
in ILUPACK. 

In our runs, the multilevel variants AMES-ESAI and AMES-AINV 
performed consistently better than the FSAI and AINV solvers in terms of 
convergence rate and storage cost. This is probably due to the multilevel 
mechanism that enabled us to exploit sparsity in the inverse factors more 
effectively. The best solutions with AMES were obtained using ILU as local 
solver, while the threshold-based dropping rules of the AINV method often 
computed a better pattern for the approximate inverse factors than the static 
approach used in the FSAI method. We can see evidence of this behaviour in 
Figures |^- where for one of the last-level blocks of the permuted coefficient 
matrix ® we compare the structure of its exact inverse factor L~^, and of 
the approximations Ml and of L~^ as computed by, respectively, the 
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AMES.FSAI code using the square of the pattern of the symmetrized block, 
and by the AMES-AINV code. Large to small entries are depicted in different 
colors, from red to orange and yellow. The approximation is good for the 
1138-bus problem (Figure]^ but poor for the orsirrA matrix (Figure]^, and 
this is confirmed by the different convergence results for the two problems, 
reported in Table On some larger problems, like the cz40948 and the 
ABACUS-shelLud problems, shown in Figures i-i we found that L ^ had 
no evident structure; in this case we had to increase the number of nonzeros in 
Ml and fF^ signihcantly to converge. For example on the ABACUSshelLud 
problem, AMES-AINV converged in 468 iteration with =11.6 while 

AMES-FSAI did not converge at this value of density. In these situations, 
uniformly better convergence were obtained using ILU as local solver. We will 
focus mostly on this choice of local solver for the experiments of this paper. 
Notice that in this case the entries of the inverse factors are not computed 
explicitly, and the application of the preconditioner is carried out through a 
backward and forward substituion procedure. Other options may be considered 
for the last level solver, such as the ARMS method [H] and enhanced FSAI 
methods m, but these are not included in the presented analysis. 


(a) orsirr.l 


Method 

Pattern Drop/condest 

Its 

nnz{ML+Mu) 

nnziA) 

tp 

tf 

ts 

AMES-FSAI 

A + A^ 

273 

1.42 

0.011 

0.023 

0.22 

FSAI 

304 

1.45 

- 

0.070 

0.23 

AMES-FSAI 

{A + A^f 

217 

3.43 

0.013 

0.035 

0.17 

FSAI 

236 

3.76 


0.088 

0.16 

AMES-AINV 

0.03 

67 

2.27 

0.016 

0.014 

0.034 

AINV 

0.07 

80 

2.22 

- 

0.016 

0.024 

AMESJLU 

8e-3/10 

31 

1.24 

0.012 

0.013 

7.4e-3 


(b) 

1138_bus 





Method 

Pattern Drop/condest 

Its 

nnz{M L,-\- Mu) 
nnziA) 

tp 

tf 

ts 

AMES-FSAI 

FSAI 

AjA^ 

7 

9 

2.24 

2.32 

5.2e-3 

0.032 

0.074 

1.2e-3 

8.9e-4 

AMES-FSAI 

[A + A^f 

5 

2.70 

5.0e-3 

0.035 

l.Oe-3 

FSAI 

6 

2.88 


0.077 

6.4e-4 

AMES-AINV 

0.6 

13 

2.85 

7.0e-3 

2.0e-3 

1.9e-3 

AINV 

0.7 

16 

2.88 

- 

6.0e-3 

3.2e-3 

AMESJLU 

0/10 

1 

1.00 

5.1e-3 

3.9e-3 

7.0e-4 
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(c) bcsstk27 


Method 

Pattern 

Drop/condest 

Its 

nnz{ML-\-Mu) 

nnz{A) 

tp 

tf 

ts 

AMES.FSAI 



8 

0.90 

0.062 

0.041 

0.021 

FSAI 

A + A^ 


19 

1.27 


0.20 

4. le-3 

AMES-FSAI 



5 

1.16 

0.063 

0.071 

0.018 

FSAI 

{A + A^y 

" 

13 

2.72 


0.47 

3.7e-3 

AMES^AINV 


le-3 

6 

1.18 

0.055 

0.040 

7.3e-3 

AINV 


0.06 

16 

0.98 

- 

0.063 

5.7e-3 

AMESJLU 

- 

0 .01/10 

6 

0.978 

0.059 

0.016 

0.010 



(d) 

epbO 





Method 

Pattern 

Drop/condest 

Its 

nnz{M L-\- Mu) 
nnz{A) 

tp 


ts 

AMES.FSAI 

A + A^ 


277 

1.67 

0.020 

0.011 

0.66 

FSAI 


400 

1.69 

- 

0.19 

0.59 

AMES.FSAI 



161 

4.32 

0.021 

0.023 

0.40 

FSAI 

(A + A^y 


290 

4.81 


0.23 

0.27 

AMES^AINV 


0.5 

132 

3.32 

0.024 

4.5e-3 

0.21 

AINV 


0.9 

347 

4.26 

- 

0.015 

0.42 

AMESJLU 

- 

0 .1/10 

7 

1.848 

0.020 

4. le-3 

0.019 


Table 2: Numerical experiments on selected matrix problems illustrating the 
performance of the multilevel sparse approximate inverse preconditioner AMES 
against the factorized approximate inverse methods FSAI and AINV. 
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Figure 3: The exact (left) and approximate (middle and right) inverse lower 
triangular factors of the 1138-bus matrix. 


4.2 Varying the number of independent clusters at the 
first level 

We considered three matrix problems in our runs: cz20468, ABACUSshelLud 
and CZ 4 O 948 . In Table we show the results varying the number of independent 
clusters p at the first level of reordering of 4 in Q. For each problem, we used 
the same number of levels in the AMES structure, and tuned the drop 
tolerance in the local ILU factorization to keep the memory ratio 
roughly constant while increasing p in different runs. Clearly, larger p results 
in more independent clusters of smaller size, and in larger Schur complement 
matrices. In the table we report the ratio between the average size of 

the independent clusters Bi and the size of the Schur complement at the first 
level. Increasing p reduces in turn to values smaller than 1. Using ILU as 

local solver, the best convergence results were obtained when « 1. Our 

experiments indicate that for good performance the size of each independent 
cluster should be approximately equal to that of the Schur complement. 
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Figure 4: The exact (left) and approximate (middle and right) inverse lower 
triangular factors of the orsirr-1 matrix. 


4.3 Varying the number of reduction levels for the 
diagonal blocks 

We consider again matrices cz40948, ABACUSshelLud and cz20468 for our 
tests. We varied the number of levels niev from 1 to 3 in the multilevel reordering 
of the diagonal blocks. In each run we tuned the dropping threshold parameter 
to have roughly the same memory cost in the experiments for each matrix. 
We chose the value of p for each problem so that « 1 as reported in 

Section 5.2. The last level blocks were factorized using ILUPACK [ 3 . The 
results reported in Table show that using more levels can reduce the number 
of iterations for similar memory ratio as we can gain additional sparsity during 
the factorization. However, probably due to our non optimized implementation, 
the solution cost tends to increase. From our experiments, a small number of 
reduction levels is recommended to use. 
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Figure 5: The exact (left) and approximate (middle and right) inverse lower 
triangular factors of the cz40948 matrix. 


4.4 Varying the number of reduction levels for the Schur 
complement 

The Schur complement matrix relative to the block C in typically preserves 
a good deal of sparsity, and this can be further exploited during the factorization 
by applying, e.g., the multilevel nested dissection reordering to As, similarly to 
what is done to the upper leftmost block B. We implemented this idea at the 
first permutation level, using ILU factorization as local solver and selecting the 
same values of p and niev for each matrix problem. We tuned the drop tolerence 
in the ILU factorization to have roughly the same memory costs in different runs. 
We varied nievAS from 0 to 3 (jiievAS = 0 means that only the diagonal blocks 
of the upper-left block B are permuted). The results reported in Tableshow 
that the simultaneous permutation of both the diagonal blocks of B and of the 
Schur complement can make the preconditioner more robust. We adopted this 
strategy in the experiments illustrated in the coming sections, selecting in each 
run the value of nig^As that minimized the total solution cost. 
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Figure 6: The exact (left) and approximate (middle and right) inverse lower 
triangular factors of the ABACUS.shelLud matrix. 


4.5 Comparison against other solvers 

We compared the performance of the AMES preconditioner against other 
three popular algebraic preconditioners for solving linear systems, that are the 
ILUPACK solver developed by Bollhofer and Saad [5], the Algebraic Recursive 
Multilevel Method (ARMS) proposed by Saad and Suchomel [44], and the 
SParse Approximate Inverse preconditioner (SPAI) introduced by Grote and 
Ruckle [H] . As in the previous experiments, for each run we recorded the CPU 
time from the start of the solution until the initial residual was reduced by 
12 orders of magnitude or until the process failed. We declared a solver failure 
when no convergence was achieved after 5000 iterations of the restarted GMRES 
method. We selected the parameters carefully to have a fair comparison between 
different methods. In AMES, following our conclusions from Section |4.2[ we 
selected the number of blocks Bi at the first level so that their average size 
is almost equal to the size of the Schur complement. For every problem we 
tested different combinations of number of levels nigy of recursive factorization 
and different values for the dropping threshold parameter droptol for the 
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Matrix 

P 

Its 

tp 


ts 

^pev-it 

sizeB 

sizeAs 


15 

151 

0.4 

0.3 

3.0 

0.020 

4.3 


20 

139 

0.4 

0.3 

2.8 

0.020 

2.5 



30 

131 

0.5 

0.4 

2.5 

0.019 

1.1 


40 

137 

0.5 

0.6 

2.6 

0.019 

0.6 


4 

258 

0.5 

1.2 

9.1 

0.035 

7.8 


6 

242 

0.5 

1.2 

8.3 

0.034 

3.8 

ABACUS shell ud 


12 

213 

0.5 

2.0 

6.8 

0.032 

1.0 


15 

253 

0.5 

2.1 

8.7 

0.034 

0.7 


15 

219 

1.1 

0.5 

10.8 

0.049 

8.7 

CZ40948 

30 

212 

1.1 

0.7 

10.0 

0.047 

2.2 

45 

198 

1.1 

1.2 

9.2 

0.046 

0.9 


50 

207 

1.1 

1.9 

9.8 

0.047 

0.5 


Table 3: The best performance of the multilevel sparse approximate inverse 
preconditioner are observed when ^ 


Matrix 

'^lev 

Its 

tp 


ts 

^perjit 


1 

113 

0.5 

0.4 

1.9 

0.017 

CZ20468 

2 

80 

0.5 

0.5 

2.3 

0.028 


3 

71 

0.6 

0.5 

4.1 

0.058 


1 

388 

0.5 

1.7 

17.6 

0.045 

ABACUS_shell_ud 

2 

381 

0.5 

1.9 

21.9 

0.057 


3 

294 

0.6 

1.9 

22.7 

0.077 


1 

198 

1.1 

1.2 

9.2 

0.046 

CZ40948 

2 

154 

1.2 

1.3 

10.4 

0.068 


3 

133 

1.3 

1.3 

17.3 

0.13 


Table 4: The number of iterations of the multilevel approximate inverse 
preconditioner can be reduced by increasing the number of reduction levels 
niev for the diagonal blocks, at roughly equal memory costs. 


factorization of the last level blocks Bi and of the Schur complements. We chose 
the best combination in terms of memory and time to solution costs for the given 
problem. Then we tuned the value of the dropping threshold in the ILUPACK, 
ARMS, SPAI and AINV solvers to have roughly equal memory costs as in 
AMES, setting the other parameters equal to their default values defined in those 
packages. The performance of these methods is rather sensitive to the dropping 
threshold parameter. For example, on the rmalO problem, ILUPACK converged 
in only 9 iterations using the default value droptol = 0.01, but the computation 
costed = 8.9 and tj = 45s; ARMS converged in 26 iterations with the 
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Matrix 

nievAS 

Its 

tp 

tf 

ts 


0 

331 

0.5 

0.2 

8.2 

CZ20468 

1 

228 

0.4 

1.3 

5.6 

2 

209 

0.4 

1.3 

4.8 


3 

181 

0.4 

1.3 

4.0 


0 

576 

0.5 

1.8 

35.0 

ABACUS_shell_ud 

1 

485 

0.5 

1.8 

29.5 

2 

414 

0.5 

1.4 

24.4 


3 

393 

0.5 

1.6 

22.2 


0 

183 

1.9 

0.5 

23.7 

CZ40948 

1 

166 

1.9 

6.4 

16.6 

2 

157 

1.9 

6.1 

14.8 


3 

152 

1.8 

6.1 

14.3 


Table 5: At roughly equal memory costs, larger reduction levels for the Schur 
complement can improve the convergence rate. 


default droptol = 0.001, costing = 33.9 and tf = 1111s; and SPAI 

could not converge in 5000 iterations with = 0.19, using the default 

value droptol = 0.6. The number of levels of recursive factorization in the 
multilevel methods ILUPACK and ARMS are calculated automatically by the 
original codes developed by their authors. We point out that the performance 
comparison between AMES and the other solvers at fixed memory occupation 
may be a little penalizing for the AINV, FSAI and SPAI preconditioners as one- 
level approximate inverses inherently need more memory; the ARMS method is a 
multilevel solver, but it factorizes the diagonal blocks without any permutation. 

In Table we show the complete results of our experiments. These include 
number of iterations (Its), density ratio costs for the 

preordering (tp), factorization {tf) and solve phase {tg). We also tested the 
unpreconditioned GMRES for these matrices problems, and no convergence 
is achieved. We clearly see the good potential of the multilevel mechanism 
incorporated in the AMES preconditioner to reduce the number of iterations 
of Krylov methods, also in comparison to other multilevel solvers at low to 
moderate memory costs. In our examples, AMES was more robust than these 
solvers especially at low memory ratios. 

4.6 Effect of overlapping 

We solved several problems from Table combining the AMES method with 
overlapping after the first level of reordering in §. In these runs, we set niev = 
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(a) cz20468 


Method 

rLnz{ML-\-Mu) 

nnz(A) 

Its 

tp 

b 

ts 

AMES 

1.26 

187 

0.3 

0.2 

4.2 

ILUPACK 

1.24 

2500 

- 

0.4 

40.3 

ARMS 

1.16 

+5000 

- 

0.1 

+6.5 

SPAI 

1.64 

+5000 

- 

4.0 

+8.0 


(b) r 

■aefskyS 




Method 

nnz{ML-\-Mij) 

nnz(A) 

Its 

tp 


ts 

AMES 

0.54 

235 

2.4 

3.7 

10.0 

ILUPACK 

0.55 

1224 

- 

2.8 

25.2 

ARMS 

2.38 

+5000 

- 

2.4 

+23.5 

SPAI 

1.83 

+5000 

- 

5040 

+243.0 


(c) ABACUS_shelLud 




Method 

nnz{M lAMjj ) 
nnz(A) 

Its 

tp 

tf 

ts 

AMES 

1.79 

453 

0.3 

0.8 

22.1 

ILUPACK 

1.82 

1411 

- 

0.5 

26.6 

ARMS 

1.88 

+5000 

- 

0.2 

+7.6 

SPAI 

2.41 

+5000 

- 

11.0 

+12.0 


(d) smeSDb 




Method 

nnz{ML-\-Mi;) 

nnz{A) 

Its 

tp 

b 

ts 

AMES 

0.85 

407 

3.5 

8.4 

39.3 

ILUPACK 

0.74 

1210 

- 

4.1 

41.4 

ARMS 

5.61 

+5000 

- 

39.0 

+54.9 

SPAI 

1.23 

+5000 

- 

3360 

+123.0 


2 , and we tuned the droptol parameter to have roughly the same memory costs 
in the experiments with and witout overlapping. In the last two columns of 
Table we give the effect of overlapping on the change in size and in number of 
nonzeros for the overlapped system. The number of iteration {Its) are almost the 
same after overlapping for problems cz20468 and cz40948, while for problems 
smeSDb, ABACUSshelLud and raefskyS we observed a consistent reduction of 
the number of iterations Its by a factor between 9.5% and 23.8% and of the 
solving time ts by a factor between 21.4% and 29.9%. This is in agreement 
with our analysis of Section In Table for each problem we studied the 
sparsity pattern of block F and the size of blocks B and C before and after 
overlapping is applied at the first reordering level. The quantity Spp denotes 
the ratio between the number of nonzero elements and the size of F, that is 
the sparsity degree . As we can see, the cz20468 and cz40948 problems 

have the smallest relative size of the separator C and also the smallest value of 
SpF] this means that less information is added to the subdomains. Following 
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(e) viscoplastic2 


Method nu.(Mr. + Mu) 

nnz{A) 

Its 

tp 


ts 

AMES 

3.07 

78 

0.9 

14.3 

3.9 

ILUPACK 

4.00 

2500 

- 

1.6 

70.0 

ARMS 

3.02 

+5000 

- 

0.9 

+10.9 

SPAI 

3.37 

+5000 

- 

244.0 

+24.0 


(f) CZ40948 




Method 

nnz(A) 

Its 

tp 


ts 

AMES 

1.41 

170 

0.7 

0.4 

7.4 

ILUPACK 

1.48 

1627 

- 

1.0 

51.1 

ARMS 

1.70 

+5000 

- 

0.9 

+21.8 

SPAI 

1.64 

+5000 

- 

8.5 

+17.2 


(g) rmalO 




Method 

Its 

tp 

if 

ts 

AMES 

2.33 

164 

3.9 

13.1 

34.5 

ILUPACK 

2.27 

1242 

- 

8.6 

82.9 

ARMS 

14.30 

+5000 

- 

203.9 

+111.3 

SPAI 

4.84 

+5000 

- 

11280 

+180 


(h) finan512 




Method 

nnz{ML+Mu) 

nnz{A) 

tp 


ts 

AMES 

0.59 

9 

0.8 

0.5 

0.8 

ILUPACK 

0.62 

11 

- 

0.7 

0.1 

ARMS 

0.58 

36 

- 

0.4 

0.5 

SPAI 

0.61 

7 

- 

4.2 

0.2 
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(i) helm2d03 


Method 

nnz{ML-\-Mu) 

nnz(A) 

Its 

tp 

if 

ts 

AMES 

0.88 

6 

6.1 

4.3 

4.6 

ILUPACK 

0.91 

7 

- 

3.7 

0.4 

ARMS 

0.93 

12 

- 

1.4 

1.5 

SPAI 

0.87 

15 

- 

100.7 

2.7 


(j) parabolic_fem 




Method 

nnz[ML -\-Mu) 
nnz(A) 

Its 

tp 

if 

ts 

AMES 

0.75 

4 

4.7 

5.7 

1.3 

ILUPACK 

0.68 

10 

- 

5.3 

0.5 

ARMS 

0.76 

12 

- 

2.0 

2.0 

SPAI 

0.77 

4 

- 

175.3 

0.8 


Table 6: Performance comparison of the multilevel approximate inverse 
preconditioner against other iterative solvers, both one-level and multilevel. 


the analysis reported in Section the overlapping technique is less likely to 
help on these two matrices, and this is also conhrmed by the numerical results. 
Differently, problems smeSDb and raefskyS show larger values of sizec and Spp 
and in fact overlapping has a better effect on convergence for these two problems. 
In our experiments we found that a small number of independent clusters p is 
recommended to use when overlapping. 

4.7 Utilizing direct solvers in the AMES framework 

The results of previous sections indicate that the multilevel mechanism can 
be effective to reduce the memory burden but, at least in our implementation, 
tends to increase the cost per iteration. As an attempt of a possible remedy, we 
performed some runs setting the dropping threshold parameter droptol equal 
to zero, and using a sparse direct solver, namely the routine MA38 from the 
HSL Mathematical Software Library [22], as a local solver. No approximation 
is introduced and the Schur complements are exact. Therefore in each problem 
we can obtain convergence in one or two iterations, and the solving phase is 
much cheaper. This can be observed in Table on selected matrix problems. 
Comparing against the results with inexact inversion, we see that using a direct 
solver as local component can save computational time at only moderate extra 
storage cost. 
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Matrix 

Method 

nnz{ML -\-Mu) 

Its 

ts 

over lapped ) 

Z A Quer lapped ) 

nnz{A) 

n(A) 

nnz(A) 

CZ20468 

overlapping 

1.33 

147 

4.3 

1.005 

1.004 

without overlapping 

1.32 

149 

4.5 

- 

- 

raefsky3 

overlapping 
without overlapping 

0.56 

0.56 

218 

286 

14.3 

20.3 

1.134 

1.135 

ABACUS_shell_ud 

overlapping 
without overlapping 

3.06 

3.03 

238 

263 

13.6 

17.4 

1.020 

1.019 

sme3Db 

overlapping 

0.91 

389 

49.1 

1.588 

1.639 

without overlapping 

0.91 

495 

62.5 

- 

- 

CZ40948 

overlapping 

1.42 

175 

12.0 

1.006 

1.004 

without overlapping 

1.40 

172 

17.5 

- 

- 


Table 7: Experiments on the effect of block overlapping on the performance of 
the multilevel sparse approximate inverse. 


Matrix problem 

Method 

sizes 

sizcc 

SpF 

CZ20468 

original 

20405 

63 

l.le-4 

after overlapping 

20450 

116 

4.3e - 5 

raefsky3 

original 

19776 

1424 

l.le-4 

after overlapping 

21184 

2864 

5.1e-4 

ABACUS_shell_ud 

original 

23184 

228 

1.3e-4 


after overlapping 

23412 

458 

6.1e-5 

sme3Db 

original 

19956 

9111 

9.2e-4 

after overlapping 

25932 

20214 

3.4e - 4 

CZ40948 

original 

40825 

123 

5.2e-5 

after overlapping 

40925 

250 

2.4e - 5 


Table 8: Effect of overlapping on matrix blocks size and sparsity. 


Matrix 

nnz{Mi:,-\-Mi/) 

nnz(A) 

Its 

tp 

if 

ts 

CZ20468 

1.28 

2 

0.7 

0.4 

1.5 

raefsky3 

2.74 

1 

3.4 

11.1 

1.3 

CZ40948 

1.87 

2 

1.2 

0.3 

0.2 

rmalO 

3.01 

1 

5.2 

11.6 

0.8 


Table 9: Using the AMES factorization as a direct solver. 
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5 Conclusions 


In this paper a recursive multilevel implementation of factorized sparse 
approximate inverse preconditioners for Krylov subspace methods was discussed. 
We used recursive combinatorial techniques and overlapping strategies as an 
attempt to remedy two typical drawbacks of explicit preconditioning, that are 
lack of robustness and high construction cost. The numerical experiments show 
that these strategies can improve the performance of conventional approximate 
inverse methods, yielding iterative solutions that can compete favourably 
against other popular solvers in use today. Parallelism can be exploited at 
various levels in our method, alongside other code optimization. Fine-grained 
blocking, filtering, postfiltering, adaptive pattern selection strategies have been 
shown to be promising approaches in other contexts [T^ [13 [HI [HI [TT] . 
and these can be considered also in our setting. In a distributed memory 
implementation, it will be natural to split the oct-tree by assigning the local 
problems to different processors. An efficient use of recursive combinatorial 
algorithms may reduce considerably the size of the Schur complements, hence 
the amount of inter-node communications. Memory demands, an important 
bottleneck of modern algorithms, are also limited, but this does not penalize 
much the overall numerical efficiency of the solver, as illustrated by the 
experiments of Tables and Overlapping does not destroy the sparsity 
structure of the matrix and can reduce further the interconnections between 
subdomains and separator set. Hence it is worthwhile considering it in a parallel 
setting as well. However, the parallel implementation of a fully distributed Schur 
complement formulation may not be trivial and will be considered in a separate 
study. 
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